devtools::load_all() # if using the rproject dowloaded from the slides
# source("utils-glm.R") # if using a standard setup
library(here)
library(tidyr) # for data manipulation
library(dplyr) # for data manipulation
library(ggplot2) # plotting
library(performance) # diagnostic
library(car) # general utilities
library(MuMIn) # model selection
We are gonna work with the admission.csv dataset
containing \(n = 400\) students for the
admission to the UCLA University. A researcher is interested in how
variables, such as gre (Graduate Record Exam scores),
gpa (GPA), rank (prestige of the undergraduate
institution) have an influence for the admission into graduate school.
The response variable, admit/don’t admit, is a binary variable.
glm()We need to set the working directory on the root of
the course folder using set.wd(). Using R Projects is just
necessary to open the .RProj file and the working directory
will be automatically correctly selected.
# reading data
admission <- read.csv(here("data", "admission.csv"), header=TRUE)
# first rows
head(admission)
## admit gre gpa rank
## 1 0 456 5.571 3
## 2 1 792 5.637 3
## 3 1 960 6.000 1
## 4 1 768 5.109 4
## 5 0 624 4.823 4
## 6 1 912 4.900 2
# check dataset structure
str(admission)
## 'data.frame': 400 obs. of 4 variables:
## $ admit: int 0 1 1 1 0 1 1 0 1 0 ...
## $ gre : int 456 792 960 768 624 912 672 480 648 840 ...
## $ gpa : num 5.57 5.64 6 5.11 4.82 ...
## $ rank : int 3 3 1 4 4 2 1 2 3 2 ...
# summary statistics
summary(admission)
## admit gre gpa rank
## Min. :0.0000 Min. :264.0 Min. :4.086 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:624.0 1st Qu.:5.043 1st Qu.:2.000
## Median :0.0000 Median :696.0 Median :5.335 Median :2.000
## Mean :0.3175 Mean :705.5 Mean :5.329 Mean :2.485
## 3rd Qu.:1.0000 3rd Qu.:792.0 3rd Qu.:5.637 3rd Qu.:3.000
## Max. :1.0000 Max. :960.0 Max. :6.000 Max. :4.000
It is very important that each variable is correctly interpreted by R:
admit is a binary variable stored as integer (0 and
1)gre is a numerical variable stored as integergpa is a numerical variables stored as double precision
numberrank is a numerical variables stored as integerWe could change the type of rank to factor because we
are going to use it as a categorical (maybe ordinal) variable.
admission$rankc <- factor(admission$rank, levels = 1:4, labels = 1:4)
We can plot the univariate distribution of each variable:
# gre and gpa
admission |>
select(gre, gpa) |>
pivot_longer(1:2) |>
ggplot(aes(x = value)) +
geom_histogram(col = "black",
fill = "lightblue") +
facet_wrap(~name, scales = "free")
admission |>
ggplot(aes(x = rank)) +
geom_bar()
admission |>
ggplot(aes(x = admit)) +
geom_bar()
Then we can cut the gpa and gre variabiles
into categories and plot the admissions for each bin (i.e., a
contingency table):
admission$gpa_c <- cut(admission$gpa, seq(4, 6, 0.2), labels = FALSE)
admission$gre_c <- cut(admission$gre, seq(260, 960, 50), labels=FALSE)
# admission ~ gpa
admission |>
ggplot(aes(x = gpa_c, fill = factor(admit))) +
geom_bar(position = position_dodge()) +
labs(fill = "Admission") +
theme(legend.position = "bottom")
# admission ~ gre
admission |>
ggplot(aes(x = gre_c, fill = factor(admit))) +
geom_bar(position = position_dodge()) +
labs(fill = "Admission") +
theme(legend.position = "bottom")
Given that the number of admitted is lower than the number of non admitted, we can have a look at the proportion of admission for each bin:
admission |>
group_by(gpa_c) |>
summarise(admit = mean(admit),
non_admit = 1 - admit) |>
pivot_longer(2:3) |>
ggplot(aes(x = factor(gpa_c), y = value, fill = name)) +
geom_col() +
labs(fill = "Admission",
y = "%",
x = "gpa") +
theme(legend.position = "bottom")
admission |>
group_by(gre_c) |>
summarise(admit = mean(admit),
non_admit = 1 - admit) |>
pivot_longer(2:3) |>
ggplot(aes(x = factor(gre_c), y = value, fill = name)) +
geom_col() +
labs(fill = "Admission",
y = "%",
x = "gpa") +
theme(legend.position = "bottom")
Finally we can have a look at the admissions as a function of the rank of the undergrad institution:
# margin = 2 means that each colum will sum to 1
prop.table(table(admission$admit, admission$rank), margin = 2)
##
## 1 2 3 4
## 0 0.4590164 0.6423841 0.7685950 0.8208955
## 1 0.5409836 0.3576159 0.2314050 0.1791045
Clearly as the rank of the institute decrease (from 1 to 4) also the proportions of admissions decrease.
glm()Now we ca fit the model using glm(). Let’s start by
fitting a null model with no predictors. We choose a binomial
glm with a logit link function.
fit0 <- glm(admit ~ 1, data = admission, family = binomial(link = "logit"))
summary(fit0)
##
## Call:
## glm(formula = admit ~ 1, family = binomial(link = "logit"), data = admission)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8741 -0.8741 -0.8741 1.5148 1.5148
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.7653 0.1074 -7.125 1.04e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 499.98 on 399 degrees of freedom
## Residual deviance: 499.98 on 399 degrees of freedom
## AIC: 501.98
##
## Number of Fisher Scoring iterations: 4
Then we can fit the full model by putting all predictors:
fit1 <- glm(admit ~ gre + gpa + rank, family = binomial(link = "logit"), data = admission)
summary(fit1)
##
## Call:
## glm(formula = admit ~ gre + gpa + rank, family = binomial(link = "logit"),
## data = admission)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5833 -0.8833 -0.6368 1.1578 2.1776
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.5794202 1.5346577 -2.984 0.00285 **
## gre 0.0019583 0.0009089 2.155 0.03119 *
## gpa 0.6999744 0.2978395 2.350 0.01876 *
## rank -0.5601415 0.1271643 -4.405 1.06e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 499.98 on 399 degrees of freedom
## Residual deviance: 459.21 on 396 degrees of freedom
## AIC: 467.21
##
## Number of Fisher Scoring iterations: 4
Firstly we can have a look to the residual ~ fitted
plot:
plot_resid(fit1, type = "pearson")
Given that the admit is a binary variables and we are
using a bernoulli model we can use the binned residuals
to have a better idea:
binres <- data.frame(performance::binned_residuals(fit1, n_bins = 20))
binres |>
ggplot(aes(x = xbar, y = ybar)) +
geom_point() +
geom_line(aes(x = xbar, y = 2*se)) +
geom_line(aes(x = xbar, y = -2*se)) +
ylim(c(-0.5,0.5)) +
xlab("Binned fitted(fit)") +
ylab("Binned residuals(fit)")
Then we can check each predictors as a function of residuals:
residualPlots(fit1, tests = FALSE)
Then we can check for influential observations:
infl <- infl_measure(fit1)
head(infl)
## dfb.1_ dfb.gre dfb.gpa dfb.rank dffit cov.r
## 1 0.008457105 0.06129327 -0.03203678 -0.01713304 -0.07098758 1.0204711
## 2 -0.051239465 0.02598965 0.03304288 0.05772245 0.10762922 0.9957914
## 3 -0.049701045 0.04886713 0.04057539 -0.05652968 0.10457047 1.0232419
## 4 0.027926685 0.05119312 -0.06896857 0.14432426 0.17534139 0.9858326
## 5 -0.016239562 0.00292659 0.01835764 -0.02746146 -0.03698701 1.0160704
## 6 0.082770141 0.13883221 -0.12839399 -0.01511768 0.18029392 1.0119995
## cook.d hat
## 1 0.0008130455 0.013721828
## 2 0.0031354662 0.005779840
## 3 0.0018795623 0.018544327
## 4 0.0132718402 0.009149732
## 5 0.0002092958 0.007606754
## 6 0.0079149982 0.018789033
Plotting using car
car::influenceIndexPlot(fit1, vars = c("Studentized", "hat", "Cook"))
Plotting also the dfbeta:
dfbeta_plot(fit1)
Check if there are observations with high studentized residuals:
outlierTest(fit1) # Testing outliers
## No Studentized residuals with Bonferroni p < 0.05
## Largest |rstudent|:
## rstudent unadjusted p-value Bonferroni p
## 198 2.196457 0.028059 NA
For potentially influential observations we could fir a model subtracting that specific observation and compare coefficients. This is similar to the dfbeta metric that suggest no influential observations on model parameters.
# Is 198 really influential?
fit1_no198 <- update(fit1, subset=-c(198))
compareCoefs(fit1, fit1_no198)
## Calls:
## 1: glm(formula = admit ~ gre + gpa + rank, family = binomial(link =
## "logit"), data = admission)
## 2: glm(formula = admit ~ gre + gpa + rank, family = binomial(link =
## "logit"), data = admission, subset = -c(198))
##
## Model 1 Model 2
## (Intercept) -4.58 -4.65
## SE 1.53 1.54
##
## gre 0.001958 0.002117
## SE 0.000909 0.000918
##
## gpa 0.700 0.701
## SE 0.298 0.299
##
## rank -0.560 -0.585
## SE 0.127 0.129
##
Firstly, we can extract model parameters, taking the exponential to interpret them as odds ratios:
broom::tidy(fit1, exponentiate = TRUE, conf.int = TRUE)
## # A tibble: 4 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.0103 1.53 -2.98 0.00285 0.000480 0.199
## 2 gre 1.00 0.000909 2.15 0.0312 1.00 1.00
## 3 gpa 2.01 0.298 2.35 0.0188 1.13 3.64
## 4 rank 0.571 0.127 -4.40 0.0000106 0.443 0.729
We can interpret these parameters as: for a unit increase in the
x, the odds of being accepted in grad school increase by
exp(beta). If we multiply the exp(beta)*100 we
obtain the expected increase in percentage. Given that we have multiple
parameters, when we intepret a specific parameter we are controlling for
other parameters.
broom::tidy(fit1, exponentiate = TRUE, conf.int = TRUE) |>
slice(-1) |>
mutate(estperc = estimate * 100)
## # A tibble: 3 × 8
## term estimate std.error statistic p.value conf.low conf.high estperc
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 gre 1.00 0.000909 2.15 0.0312 1.00 1.00 100.
## 2 gpa 2.01 0.298 2.35 0.0188 1.13 3.64 201.
## 3 rank 0.571 0.127 -4.40 0.0000106 0.443 0.729 57.1
To better interpret the parameters we need to make sure that the
scale is meaningful. For example, the gre effect seems to
be very small but statistically significant. The reason is that a unit
increase in gre is very small. We could for example rescale
the variable dividing for a constant term:
par(mfrow = c(1,2))
hist(admission$gre)
hist(admission$gre/100)
Let’s try fitting the model with the new variable:
admission$gre100 <- admission$gre/100
fit2 <- glm(admit ~ gre100 + gpa + rank, family = binomial(link = "logit"), data = admission)
summary(fit2)
##
## Call:
## glm(formula = admit ~ gre100 + gpa + rank, family = binomial(link = "logit"),
## data = admission)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5833 -0.8833 -0.6368 1.1578 2.1776
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.57942 1.53466 -2.984 0.00285 **
## gre100 0.19583 0.09089 2.155 0.03119 *
## gpa 0.69997 0.29784 2.350 0.01876 *
## rank -0.56014 0.12716 -4.405 1.06e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 499.98 on 399 degrees of freedom
## Residual deviance: 459.21 on 396 degrees of freedom
## AIC: 467.21
##
## Number of Fisher Scoring iterations: 4
broom::tidy(fit2, exponentiate = TRUE, conf.int = TRUE) |>
slice(-1) |>
mutate(estperc = estimate * 100)
## # A tibble: 3 × 8
## term estimate std.error statistic p.value conf.low conf.high estperc
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 gre100 1.22 0.0909 2.15 0.0312 1.02 1.46 122.
## 2 gpa 2.01 0.298 2.35 0.0188 1.13 3.64 201.
## 3 rank 0.571 0.127 -4.40 0.0000106 0.443 0.729 57.1
Now the gre effect is more meaningful. Notice how the
overall model fitting is not changed togheter with other parameters. We
are only rescaling variables.
Generally we can plot the effects for a better overview of the model:
plot(effects::allEffects(fit1))
To interpret the parameters in probability terms we could use the divide by 4 rule that express the maximum slope (i.e., the maximum probability increase):
coef(fit2)[-1]/4
## gre100 gpa rank
## 0.04895789 0.17499359 -0.14003536
Similarly we can compute the marginal effects for each variable that represents the average slope:
margins::margins(fit2) |> summary()
## factor AME SE z p lower upper
## gpa 0.1367 0.0568 2.4072 0.0161 0.0254 0.2479
## gre100 0.0382 0.0174 2.1974 0.0280 0.0041 0.0723
## rank -0.1094 0.0227 -4.8138 0.0000 -0.1539 -0.0648
Beyond the model coefficients, we could use a likelihood ratio test. Let’s start by comparing the null model with the current model. We hope that our variables combinations are doing a better job compared to a null model:
anova(fit0, fit1, test = "LRT")
## Analysis of Deviance Table
##
## Model 1: admit ~ 1
## Model 2: admit ~ gre + gpa + rank
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 399 499.98
## 2 396 459.21 3 40.771 7.314e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As expected from model summary and the deviance reduction, the variables are useful to predict the probability of admission. How useful? we could use some \(R^2\)-like measures:
performance::r2_tjur(fit1)
## Tjur's R2
## 0.09946513
Despite useful, the model has a low \(R^2\). Furthermore the correct classification rate is higher than the chance level but relatively low:
1 - error_rate(fit1)
## [1] 0.705
We could try a model comparison starting from the null model and finishing to the overall model:
fit2 <- update(fit2, na.action = na.fail) # required for mumin
dredge(fit2)
## Global model call: glm(formula = admit ~ gre100 + gpa + rank, family = binomial(link = "logit"),
## data = admission, na.action = na.fail)
## ---
## Model selection table
## (Intrc) gpa gr100 rank df logLik AICc delta weight
## 8 -4.5790 0.7000 0.1958 -0.5601 4 -229.603 467.3 0.00 0.701
## 6 -4.3760 0.9337 -0.5822 3 -231.967 470.0 2.69 0.183
## 7 -1.3960 0.2731 -0.5537 3 -232.428 470.9 3.61 0.115
## 5 0.6366 -0.5863 2 -237.856 479.7 12.43 0.001
## 4 -6.0450 0.6806 0.2279 3 -240.054 486.2 18.86 0.000
## 3 -2.9250 0.3016 2 -242.861 489.8 22.45 0.000
## 2 -5.8860 0.9556 2 -243.484 491.0 23.69 0.000
## 1 -0.7653 1 -249.988 502.0 34.68 0.000
## Models ranked by AICc(x)
The model selection table suggest that the full model is the most appropriate, at least considering the AIC.
The script has been adapted from Prof. Paolo Girardi (A.Y. 2021/2022)↩︎